Regulatory variants associated with COPD

Chronic Obstructive Pulmonary Disease (COPD) is a lung condition characterized by breathing difficulties. Previously it was studied the gene expression profiles of this disease and published in PulmonDB database. But it was unknown which variants cause the disease, which is the main objective, where I found the variants that change the expression of the previously analyzed genes.

To do that, I focused on retrieving the variants previously annotated for COPD genes and the disease, then analyzing them with Regulatory Sequence Analysis Tools (RSAT) to retrieve transcriptional factor motifs where the variants are affecting. Then I searched for genes close to the variants and filtered for those in a meta-analysis (realized by Ana Beatriz Villaseñor), which complemented the data from PulmonDB.

Figure 1:Flowchart of the tools used.

Retrieve variants

For the variant search, I used a database created by Lucia Ramirez, which included data from the Genome-Wide Association Study (GWAS), ClinVar, Genotype-Tissue Expression (GTEx) project, and the Single Nucleotide Polymorphism Database (dbSNP). I searched for variants that were associated with the disease and the genes from the meta-analysis from Ana B.

Disease associated variants

I started filtering the variants related to COPD, FEV1, FVC and Emphysema. Then, I created a file where I kept the variant ID, chromosome name, and the position. I retrieved 4,370 variants from here.

# I realized the filter with grep, using -E, which determines that I will be applying regular expression.       
# Also, I sorted my results with sort, using the parameters -n (so that its order in numerical order) and -k (to specify the reference column).  
# The parameter -F "\t" from awk specifies that the delimitation of columns is by tabs. 
grep -E "(COPD|FVC|FEV1|Chronic Obstructive Airway Disease|Chronic Lung Injury|Lung Diseases Obstructive|Emphysema)" Data/Gwas_Clinvar_Gtex_dbSNP_Joined.bed | awk -F"\t" '{OFS=FS}{if ($3 == "NA") $3 = $20}{if ($15 == "NA") $15 = $22}{if ($16 == "NA") $16 = $4}{if ($3 != "NA") print $1 "\t" $2 "\t" $3 "\t" $15 "\t" $16}' | sort -nk 2 | uniq > Data/disease-associated-variants.txt

Gene associated variants

To retrieve the variants associated to the genes, first I created a file with all the genes from the meta-analysis, with the corresponding filters (Tissue = qval.random <= 0.05 & I2 < 0.4 & num_exp >= 9 and blood = qval.random <= 0.05 & I2 < 0.4 & num_exp >= 4). I obtained 1,716 unique genes (161 from tissue and 1570 from blood).

Tissue_meta_analysis_f <- filter(Tissue_meta_analysis, qval.random <= 0.05 & I2 < 0.4 & num_exp >= 9)
Blood_meta_analysis_f <- filter(Blood_meta_analysis, qval.random <= 0.05 & I2 < 0.4 & num_exp >= 4)
write.csv(unique(c(Tissue_meta_analysis_f$X, Blood_meta_analysis_f$X)), "Data/meta-genes.txt", row.names = F, col.names = F, quote = F)

I searched for the variants associated with the final genes from the meta-analysis (1,716). I retrieved 70,134 variants from here.

# In this case, I use grep with -w (retrieves variants only if the whole word matches) and -f (specifies that it will be using a file in the search), and then I create the same file as with the diseases.  
grep -wf Data/meta-genes.txt Data/Gwas_Clinvar_Gtex_dbSNP_Joined.bed | awk -F"\t" '{OFS=FS}{if ($3 == "NA") $3 = $20}{if ($15 == "NA") $15 = $22}{if ($16 == "NA") $16 = $4}{if ($3 != "NA") print $1 "\t" $2 "\t" $3 "\t" $15 "\t" $16}' | sort -nk 2 | uniq > Data/genes-associated-variants.txt

Variants concatenation and filtering

I joined both of my files and end up with 74,142 variants. Then, I looked for the coordinates of the genes from UCSC genome browser and assure that the variants were close to the genes. By doing this I ended up with 59,648 variants to begin with the analysis.

Once I had all the variants ID I created the corresponding files for the analysis. I created a bed file in which I kept the chromosome name, the start position, and the end. I also needed a vcf file; I base the columns that I needed from the vcf manual. In this step, I also put on the corresponding header in each file.

cat Data/disease-associated-variants.txt Data/genes-associated-variants.txt | sort -nk 2 | uniq > Data/all_variants_NoFilter.txt
all_variants_NoFilter <- read.table("Data/all_variants_NoFilter.txt")
colnames(all_variants_NoFilter) <- c("Chr", "Pos", "ID", "Ref", "Alt")
all_variants_NoFilter$Chr <- gsub("chr", "", all_variants_NoFilter$Chr)
all_variants_Filter <- unique(unlist(sapply(c(1:nrow(UCSC_genes)), function(x){return(filter(all_variants_NoFilter, (Chr == UCSC_genes$chrom[x]) & (Pos >= (UCSC_genes$cdsStart[x]-5000)) & (Pos <= (UCSC_genes$cdsStart[x]+5000)))$ID)})))
writeLines(c("##fileformat=VCFv4.2","#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"), "Data/all_variants.vcf")
write.table(data.frame(all_variants_NoFilter[all_variants_NoFilter$ID %in% all_variants_Filter,], qual = ".", filter = ".", info = "."), "Data/all_variants.vcf", row.names = F, col.names = F, quote = F, sep = "\t", append = T)
bed <- all_variants_NoFilter[all_variants_NoFilter$ID %in% all_variants_Filter, c(1, 2)]
bed$start <- bed$Pos -1
writeLines(c("chrom \t chromStar \t chromEnd"), "Data/all_variants.bed")
write.table(bed[,c(1, 3, 2)], "Data/all_variants.bed", row.names = F, col.names = F, quote = F, sep = "\t", append = T)
# Need to remove R and Y, because RSAT don't recognize IUPAC annotation. 
# I create a file with the JASPAR TF IDs since I'll be needing this later.
grep '>' Data/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt | tr -d '>' > Data/JASPAR_IDs.txt

Identify affected TFBS

To identify the affected TFBS. I analyze motif changes with variation tools from Regulatory Sequence Analysis Tools (RSAT).
First, I needed to create a background model, so I used fetch-sequences, which receives a BED file and collects the sequences from the UCSC genome browser. I chose to retrieve 100pb upstream and 100pb downstream from the variant. And then, given that sequence, I created my background model with the tool create-background-model.
I used convert-variations to retrieve a varBed file from a VCF; this is a format from RSAT that facilitates the processing of the information of the variants. Then I used retrieve-variation-seq, which gives you the sequences surrounding the variant (30pb). Finally, I used variation-scan, which provides you with the probability of having a new instance in the sequence from the variant given the background model binding site (created before); this is done by processing different variants with the JASPAR non-redundant motif database and comparing the scores and p-values to find the TFBS that are being affected.

# qsub varScan.sge
rsat fetch-sequences -i Data/all_variants.bed -o Results/RSAT/all_variants_bg.fasta -v 1 -downstr_ext 100 -upstr_ext 100 -genome hg38
rsat create-background-model -v 1 -i Results/RSAT/all_variants_bg.fasta -markov 2 -out_format transitions -o Results/RSAT/all_variants_bg_model.txt
rsat convert-variations -v 1 -i Data/all_variants.vcf -from vcf -to varBed -o Results/RSAT/all_variants.varBed
rsat retrieve-variation-seq -species Homo_sapiens -assembly GRCh38 -i  Results/RSAT/all_variants.varBed -format varBed -mml 30 -v 1 -o  Results/RSAT/all_variants.varSeq
rsat variation-scan -i  Results/RSAT/all_variants.varSeq -m /cm/shared/apps/rsat/rsat/public_html/motif_databases/JASPAR/Jaspar_2020/nonredundant/JASPAR2020_CORE_vertebrates_non-redundant_pfms.tf -m_format transfac -bg Results/RSAT/all_variants_bg_model.txt -o  Results/RSAT/all_variants_varScan.tsv -v 1 -mml 30 -lth score 1 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3

Once I had the results from variation-Scan, I analyzed them. To do that, I filtered just the ones that fulfill quality requirements, and using the JASPAR data, I added a column with the corresponding TF from the motif.

COPD_varScan <- read.csv("Results/RSAT/all_variants_varScan.tsv", header = T, sep = "\t", comment.char = ";") %>% filter(pval_ratio > 10 & (best_w == 0 | worst_w == 0 | (best_w * worst_w) < 0) & best_pval < 1e-4 ) %>% merge(JASPAR_IDs, by = "X.ac_motif", all.x = T)
COPD_varScan <- COPD_varScan[,c(1, 22, 2:21)]

To make sure that my variants were affecting the TF, I realized two different negative controls. One with vSampler and the other one with matrix permutations, both explained later. And I compute both of them for tissue and blood data independently.

Tissue negative control

For the tissue, I used the Tissue meta-analysis with the filtered previously mentioned. From them, I searched for the TFs on this data-set tissue data.

Tissue_COPD_varScan <- COPD_varScan[COPD_varScan$TF %in% Tissue_meta_analysis$X,]

vSampler

vSampler is a tool that receives the list of variants you’re using, and for each one of them retrieves random variants given eight different properties. This is used to compute a negative control.
To use vSampler, I needed to create a file with the chromosome name and the variant’s position. For that, I used only the significant ones from variationScan and created the input file.

Tissue_coords <- as.data.frame(matrix(unlist(str_split(unlist(str_split(unlist(str_split(Tissue_COPD_varScan$var_coord, ":")), "-")), "_")), ncol = 4, byrow = T))[,(1:2)]
write.table(Tissue_coords, file = "Data/Tissue_vSampler_Coord-Only.txt", append = F, sep = "\t", row.names = F, col.names = F, quote = F)
rm(Tissue_coords)

I ran vSampler online. And given the result list of variants, I created the vcf file and the bed file. Then, I ran all the same steps as before from RSAT-tools.

grep control1 Data/Tissue_vSampler/anno.out.txt | awk '{print $2 "\t" $3-1 "\t" $3}' | sed -e '1 i\chrom \t chromStar \t chromEnd' > Data/TissueNC_vSampler.bed
grep control1 Data/Tissue_vSampler/anno.out.txt | awk '{print $2 "\t" $3 "\t" "." "\t" $4 "\t" $5 "\t" "." "\t" "." "\t" "."}' |  sed -e '1 i\##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > Data/TissueNC_vSampler.vcf
rsat fetch-sequences -i Data/TissueNC_vSampler.bed -o Results/RSAT/TissueNC_vSampler_bg.fasta -v 1 -downstr_ext 100 -upstr_ext 100 -genome hg38
rsat create-background-model -v 1 -i Results/RSAT/TissueNC_vSampler_bg.fasta -markov 2 -out_format transitions -o Results/RSAT/TissueNC_vSampler_bg_model.txt
rsat convert-variations -v 1 -i Data/TissueNC_vSampler.vcf -from vcf -to varBed -o Results/RSAT/TissueNC_vSampler.varBed
rsat retrieve-variation-seq -species Homo_sapiens -assembly GRCh38 -i Results/RSAT/TissueNC_vSampler.varBed -format varBed -mml 30 -v 1 -o Results/RSAT/TissueNC_vSampler.varSeq
rsat variation-scan -i Results/RSAT/TissueNC_vSampler.varSeq -m /cm/shared/apps/rsat/rsat/public_html/motif_databases/JASPAR/Jaspar_2020/nonredundant/JASPAR2020_CORE_vertebrates_non-redundant_pfms.tf -m_format transfac -bg  Results/RSAT/TissueNC_vSampler_bg_model.txt -o Results/RSAT/TissueNC_vSampler_varScan.tsv -v 1 -mml 30 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3

Matrix permutation

I retrieved the Jaspar PFM matrix from the previously filtered TFs and then saved it into a file.

Jaspar_Tissue_PFMatrix <- getMatrixByID(JASPAR2020, ID = Tissue_COPD_varScan$X.ac_motif)
write_transfac(Jaspar_Tissue_PFMatrix, "Data/TissueNC_PFMatrix.txt", overwrite = T)
rm(Jaspar_Tissue_PFMatrix)

Again I used some tools from RSAT to perform the negative control. To do that, I created five permutations of my matrices and then ran again variation-scan with the new matrix file.

# Because the transfac format has the name of the motif in AC, in order to function I needed to correct that in the matrix file.
cat Data/TissueNC_PFMatrix.txt | sed -e  's/NA/AC/' > Data/TissueNC_PFMatrix.tf
rm Data/TissueNC_PFMatrix.txt 
rsat permute-matrix -v 0 -in_format transfac -i Data/TissueNC_PFMatrix.tf -perm 5 -out_format transfac -o Results/RSAT/TissueNC_5perm_transfac.tf
rsat variation-scan -i Results/RSAT/all_variants.varSeq -m Results/RSAT/TissueNC_5perm_transfac.tf -m_format transfac -bg Results/RSAT/all_variants_bg_model.txt -o Results/RSAT/tissueNC_5perm_varScan.tsv -v 1 -mml 30 -lth score 1 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3

Tissue negative control distribution analysis

To scan both variation-scan results from both of my controls independently, I calculated the distribution of each motif’s pval ratio, which is the ratio between the worst pval and the best one (the bigger the ratio, the most significant it will be). Then, I saw the distribution on the control motif for each variant and kept just the significant ones.

# I started by loading the results from variation-scan from vSampler and created the corresponding data to create the distribution. 
TissueNC_vSampler_varScan <- read.csv("Results/RSAT/TissueNC_vSampler_varScan.tsv", header = T, sep = "\t", comment.char = ";")
TissueNC_vSampler_varScan <- merge(TissueNC_vSampler_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
TissueNC_vSampler_varScan <- TissueNC_vSampler_varScan[,c(1, 22, 2:21)]
# I did the same with the results from variation-scan of the permutations. 
TissueNC_5perm_varScan <- read.csv("Results/RSAT/TissueNC_5perm_varScan.tsv", header = T, sep = "\t", comment.char = ";")
TissueNC_5perm_varScan$perm <- unlist(as.data.frame(matrix(unlist(str_split(TissueNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[2])
TissueNC_5perm_varScan$X.ac_motif <- unlist(as.data.frame(matrix(unlist(str_split(TissueNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[1])
TissueNC_5perm_varScan <- merge(TissueNC_5perm_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
TissueNC_5perm_varScan <- TissueNC_5perm_varScan[,c(1, 23, 2:22)]
# Using some functions created before I realize the distributions. 
Tissue_vSampler_distribution <- unlist(sapply(X = unique(Tissue_COPD_varScan$var_id), function(x){distribution_analysis(x, Tissue_COPD_varScan, TissueNC_vSampler_varScan)}))
Tissue_vSampler_distribution <- Tissue_vSampler_distribution[Tissue_vSampler_distribution < 0.05]
Tissue_vSampler_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Tissue_vSampler_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Tissue_vSampler_distribution))
colnames(Tissue_vSampler_distribution) <- c("var_id", "TF", "pvalue")
Tissue_perm_distribution <- unlist(sapply(X = unique(Tissue_COPD_varScan$var_id), function(x){distribution_analysis(x, Tissue_COPD_varScan, TissueNC_5perm_varScan)}))
Tissue_perm_distribution <- Tissue_perm_distribution[Tissue_perm_distribution < 0.05]
Tissue_perm_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Tissue_perm_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Tissue_perm_distribution))
colnames(Tissue_perm_distribution) <- c("var_id", "TF", "pvalue")
# Filter from my original data just the ones that were significant on the distributions given my negatives controls. 
Tissue_significant_var <- unique(rbind(Tissue_perm_distribution, Tissue_vSampler_distribution)[c("var_id", "TF")])
Tissue_significant_varScan <- merge(Tissue_significant_var, Tissue_COPD_varScan, by = c("var_id", "TF"))
Tissue_significant_varScan  <- Tissue_significant_varScan[,c(3, 2, 1, 4:22)]
rm(TissueNC_vSampler_varScan)
rm(TissueNC_5perm_varScan)
rm(Tissue_vSampler_distribution)
rm(Tissue_perm_distribution)

Blood negative control

I used the blood meta-analysis realized by Ana B and filtered just the TFs on the blood data.

Blood_COPD_varScan <- COPD_varScan[COPD_varScan$TF %in% Blood_meta_analysis$X,]

vSampler

Here I did the same as with the tissue data, so I created the corresponding file with the variant’s position and ran vSampler online.

Blood_coords <- as.data.frame(matrix(unlist(str_split(unlist(str_split(unlist(str_split(Blood_COPD_varScan$var_coord, ":")), "-")), "_")), ncol = 4, byrow = T))[,(1:2)]
write.table(Blood_coords, file = "Data/Blood_vSampler_Coord-Only.txt", append = F, sep = "\t", row.names = F, col.names = F, quote = F)
rm(Blood_coords)

Given my result list of variants, I created the vcf file and the bed file. Then, I ran all the steps from RSAT-tools.

grep control1 Data/Blood_vSampler/anno.out.txt | awk '{print $2 "\t" $3-1 "\t" $3}' | sed -e '1 i\chrom \t chromStar \t chromEnd' > Data/BloodNC_vSampler.bed
grep control1 Data/Blood_vSampler/anno.out.txt | awk '{print $2 "\t" $3 "\t" "." "\t" $4 "\t" $5 "\t" "." "\t" "." "\t" "."}' |  sed -e '1 i\##fileformat=VCFv4.2\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > Data/BloodNC_vSampler.vcf
rsat fetch-sequences -i Data/BloodNC_vSampler.bed -o Results/RSAT/BloodNC_vSampler_bg.fasta -v 1 -downstr_ext 100 -upstr_ext 100 -genome hg38
rsat create-background-model -v 1 -i Results/RSAT/BloodNC_vSampler_bg.fasta -markov 2 -out_format transitions -o Results/RSAT/BloodNC_vSampler_bg_model.txt
rsat convert-variations -v 1 -i Data/BloodNC_vSampler.vcf -from vcf -to varBed -o Results/RSAT/BloodNC_vSampler.varBed
rsat retrieve-variation-seq -species Homo_sapiens -assembly GRCh38 -i Results/RSAT/BloodNC_vSampler.varBed -format varBed -mml 30 -v 1 -o Results/RSAT/BloodNC_vSampler.varSeq
rsat variation-scan -i Results/RSAT/BloodNC_vSampler.varSeq -m /cm/shared/apps/rsat/rsat/public_html/motif_databases/JASPAR/Jaspar_2020/nonredundant/JASPAR2020_CORE_vertebrates_non-redundant_pfms.tf -m_format transfac -bg  Results/RSAT/BloodNC_vSampler_bg_model.txt -o Results/RSAT/BloodNC_vSampler_varScan.tsv -v 1 -mml 30 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3

Matrix permutation

Just as with tissue, I retrieved the Jaspar PFM matrix and then ran the corresponding tools from RSAT.

Jaspar_Blood_PFMatrix <- getMatrixByID(JASPAR2020, ID = Blood_COPD_varScan$X.ac_motif)
write_transfac(Jaspar_Blood_PFMatrix, "Data/BloodNC_PFMatrix.txt", overwrite = T)
rm(Jaspar_Blood_PFMatrix)
# Because the transfac format has the name of the motif in AC, in order to function I needed to correct that in the matrix file.
cat Data/BloodNC_PFMatrix.txt | sed -e  's/NA/AC/' > Data/BloodNC_PFMatrix.tf
rm Data/BloodNC_PFMatrix.txt 
rsat permute-matrix -v 0 -in_format transfac -i Data/BloodNC_PFMatrix.tf -perm 5 -out_format transfac -o Results/RSAT/BloodNC_5perm_transfac.tf
rsat variation-scan -i Results/RSAT/all_variants.varSeq -m Results/RSAT/BloodNC_5perm_transfac.tf -m_format transfac -bg Results/RSAT/all_variants_bg_model.txt -o Results/RSAT/BloodNC_5perm_varScan.tsv -v 1 -mml 30 -lth score 1 -lth w_diff 1 -lth pval_ratio 10 -uth pval 1e-3

Blood negative control distribution analysis

I ran the same functions but with my blood results to analyze the distribution on the control motif for each variant and kept just the significant ones.

# I started by loading the results from variation-scan from vSampler and created the corresponding data to create the distribution. 
BloodNC_vSampler_varScan <- read.csv("Results/RSAT/BloodNC_vSampler_varScan.tsv", header = T, sep = "\t", comment.char = ";")
BloodNC_vSampler_varScan <- merge(BloodNC_vSampler_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
BloodNC_vSampler_varScan <- BloodNC_vSampler_varScan[,c(1, 22, 2:21)]
# I did the same with the results from variation-scan of the permutations. 
BloodNC_5perm_varScan <- read.csv("Results/RSAT/BloodNC_5perm_varScan.tsv", header = T, sep = "\t", comment.char = ";")
BloodNC_5perm_varScan$perm <- unlist(as.data.frame(matrix(unlist(str_split(BloodNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[2])
BloodNC_5perm_varScan$X.ac_motif <- unlist(as.data.frame(matrix(unlist(str_split(BloodNC_5perm_varScan$X.ac_motif, "_")), ncol = 2, byrow = T))[1])
BloodNC_5perm_varScan <- merge(BloodNC_5perm_varScan, JASPAR_IDs, by = "X.ac_motif", all.x = T)
BloodNC_5perm_varScan <- BloodNC_5perm_varScan[,c(1, 23, 2:22)]
# Using some functions created before I realize the distributions. 
Blood_vSampler_distribution <- unlist(sapply(X = unique(Blood_COPD_varScan$var_id), function(x){distribution_analysis(x, Blood_COPD_varScan, BloodNC_vSampler_varScan)}))
Blood_vSampler_distribution <- Blood_vSampler_distribution[Blood_vSampler_distribution < 0.05]
Blood_vSampler_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Blood_vSampler_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Blood_vSampler_distribution))
colnames(Blood_vSampler_distribution) <- c("var_id", "TF", "pvalue")
Blood_perm_distribution <- unlist(sapply(X = unique(Blood_COPD_varScan$var_id), function(x){distribution_analysis(x, Blood_COPD_varScan, BloodNC_5perm_varScan)}))
Blood_perm_distribution <- Blood_perm_distribution[Blood_perm_distribution < 0.05]
Blood_perm_distribution <- cbind(as.data.frame(matrix(unlist(str_split(names(Blood_perm_distribution), "\\.")), ncol = 2, byrow = T)), unlist(Blood_perm_distribution))
colnames(Blood_perm_distribution) <- c("var_id", "TF", "pvalue")
# Filter from my original data just the ones that were significant on the distributions given my negatives controls. 
Blood_significant_var <- unique(rbind(Blood_perm_distribution, Blood_vSampler_distribution)[c("var_id", "TF")])
Blood_significant_varScan <- merge(Blood_significant_var, Blood_COPD_varScan, by = c("var_id", "TF"))
Blood_significant_varScan  <- Blood_significant_varScan[,c(3, 2, 1, 4:22)]
rm(BloodNC_vSampler_varScan)
rm(BloodNC_5perm_varScan)
rm(Blood_vSampler_distribution)
rm(Blood_perm_distribution)

Gene association

Given my results I associated each variant with the closest gene given the meta-analysis realized by Ana B.

I’ll be using the Transcripts per Million (TPM) from the Genotype-Tissue Expression (GTEx) to select just the TFs that are active in normal conditions. So the next code is to select the data that I’ll be using later on.

awk -F "\t" '{print $1 "\t" $2 "\t" $39 "\t" $56}' GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct > GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm_lung_blood.csv

Variant Effect Predictor

To do the gene search I used Ensembl Variant Effect Predictor (VEP) from ensembl and I looked for 5000 pb upstream and 1000 pb downstream from the variant.

# Here I create a file with the variant id so that I can create a vcf file just with the variants that I'll be using for my tissue and blood data. 
write.table(unique(Tissue_significant_varScan$var_id), "Results/Tissue_significant_varID_varScan.txt", sep = "\t", quote = F, col.names = F, row.names = F)
# grep -wf Results/Tissue_significant_varID_varScan.txt Data/all_variants.vcf > Results/Tissue_significant_varID_varScan.vcf
write.table(unique(Blood_significant_varScan$var_id), "Results/Blood_significant_varID_varScan.txt", sep = "\t", quote = F, col.names = F, row.names = F)
# grep -wf Results/Blood_significant_varID_varScan.txt Data/all_variants.vcf > Results/Blood_significant_varID_varScan.vcf

I ran VEP online and download my results in R.

Tissue VEP results

To know if the effect is really happening between the genes and the TFs, I search in GTEx the Transcripts Per Million (TPM) of lung and blood and remove all the genes and TFs that had a TPM of 0, from my results of VEP.

Tissue_VEP <- read.csv("Results/Tissue_VEP.txt", sep = "\t")
Tissue_VEP <- Tissue_VEP[Tissue_VEP$SYMBOL %in% Tissue_meta_analysis_f$X,]
Tissue_data <- merge(data.frame(merge(Tissue_meta_analysis, Tissue_VEP, by.x = "X", by.y = "SYMBOL" )), Tissue_significant_varScan, by.x = "X.Uploaded_variation", by.y = "var_id")
Tissue_data$TF.TE.random <- sapply(Tissue_data$TF, function(x){return(as.numeric(filter(Tissue_meta_analysis, X == x)$TE.random))})
Tissue_data <- Tissue_data[,c(1, 59, 53, 50, 51, 71, 2, 9, 3, 5, 6, 12:14, 19, 20, 29, 41)]
Tissue_data <- unique(Tissue_data)
colnames(Tissue_data) <- c("var_id", "varScan_pval.ratio", "var_coord", "motif", "TF", "TF_TE.random", "gene", "gene_qval.random", "gene_TE.random", "gene_tau2", "gene_I2", "ALT", "Consequence", "IMPACT", "EXON", "INTRON", "DISTANCE", "CLIN_SIG")
# GTEx filter
GTEx_tissue_data <- rbind(GTEx_data[GTEx_data$Description %in% Tissue_data$TF,], GTEx_data[GTEx_data$Description %in% Tissue_data$gene,])
Tissue_data <- Tissue_data[Tissue_data$TF %in% GTEx_tissue_data[GTEx_tissue_data$Lung != 0,]$Description,]
Tissue_data <- Tissue_data[Tissue_data$gene %in% GTEx_tissue_data[GTEx_tissue_data$Lung != 0,]$Description,]
head(Tissue_data) %>% htmlTable
var_id varScan_pval.ratio var_coord motif TF TF_TE.random gene gene_qval.random gene_TE.random gene_tau2 gene_I2 ALT Consequence IMPACT EXON INTRON DISTANCE CLIN_SIG
1 rs1002699754 31.96 2:61017812-61017813_+ MA0137.3 STAT1 0.0379079596826783 PEX13 0.0125637541750527 0.115720512184952 0.00308264978286661 0.379060316925231 T synonymous_variant LOW 1/4 - - -
3 rs1002699754 31.96 2:61017812-61017813_+ MA0137.3 STAT1 0.0379079596826783 PEX13 0.0125637541750527 0.115720512184952 0.00308264978286661 0.379060316925231 C synonymous_variant LOW 1/4 - - -
5 rs1002699754 31.96 2:61017812-61017813_+ MA0137.3 STAT1 0.0379079596826783 PEX13 0.0125637541750527 0.115720512184952 0.00308264978286661 0.379060316925231 C synonymous_variant LOW 2/3 - - -
7 rs1002699754 31.96 2:61017812-61017813_+ MA0137.3 STAT1 0.0379079596826783 PEX13 0.0125637541750527 0.115720512184952 0.00308264978286661 0.379060316925231 T synonymous_variant LOW 1/2 - - -
9 rs1002699754 31.96 2:61017812-61017813_+ MA0137.3 STAT1 0.0379079596826783 PEX13 0.0125637541750527 0.115720512184952 0.00308264978286661 0.379060316925231 A synonymous_variant LOW 1/4 - - uncertain_significance
11 rs1002699754 31.96 2:61017812-61017813_+ MA0137.3 STAT1 0.0379079596826783 PEX13 0.0125637541750527 0.115720512184952 0.00308264978286661 0.379060316925231 A synonymous_variant LOW 2/3 - - uncertain_significance

Figure 2

Figure 2: CCL2

Blood VEP results

I did the same for my blood results. Since I got a huge list from VEP that were positive, I did an extra filter. So, I looked for the coordinates of the genes from UCSC genome browser and assure that the gene was between my variant. Doing this filter helped me to remove the ones that weren’t close to the gene.

Blood_VEP <- read.csv("Results/Blood_VEP.txt", sep = "\t")
Blood_VEP <- Blood_VEP[Blood_VEP$SYMBOL %in% Blood_meta_analysis_f$X,]
Blood_data <- merge(data.frame(merge(Blood_meta_analysis, Blood_VEP, by.x = "X", by.y = "SYMBOL" )), Blood_significant_varScan, by.x = "X.Uploaded_variation", by.y = "var_id")
Blood_data$TF.TE.random <- sapply(Blood_data$TF, function(x){return(filter(Blood_meta_analysis, X == x)$TE.random)})
Blood_data <- Blood_data[,c(1, 59, 53, 50, 51, 71, 2, 9, 3, 5, 6, 12:14, 19, 20, 29, 41)]
Blood_data <- unique(Blood_data)
colnames(Blood_data) <- c("var_id", "varScan_pval.ratio", "var_coord", "motif", "TF", "TF_TE.random", "gene", "gene_qval.random", "gene_TE.random", "gene_tau2", "gene_I2", "ALT", "Consequence", "IMPACT", "EXON", "INTRON", "DISTANCE", "CLIN_SIG")
# UCSC genes
Blood_UCSC_genes <- UCSC_genes[UCSC_genes$name2 %in% Blood_meta_analysis_f$X,]
Blood_varScan_coord <- as.data.frame(matrix(unlist(str_split(unlist(str_split(unlist(str_split(Blood_significant_varScan$var_coord, ":")), "-")), "_")), ncol = 4, byrow = T)) 
Blood_genes <- unique(unlist(sapply(c(1:nrow(Blood_significant_varScan)), function(x){return(filter(Blood_UCSC_genes, (chrom == Blood_varScan_coord$V1[x]) & ((cdsStart-5000) <= Blood_varScan_coord$V2[x]) & ((cdsStart+5000) >= Blood_varScan_coord$V3[x]))$name2)})))
Blood_data <- Blood_data[Blood_data$gene %in% Blood_genes,]
# GTEx filter
GTEx_blood_data <- rbind(GTEx_data[GTEx_data$Description %in% Blood_data$TF,], GTEx_data[GTEx_data$Description %in% Blood_data$gene,])
Blood_data <- Blood_data[Blood_data$TF %in% GTEx_blood_data[GTEx_blood_data$`Whole Blood` != 0,]$Description,]
Blood_data <- Blood_data[Blood_data$gene %in% GTEx_blood_data[GTEx_blood_data$`Whole Blood` != 0,]$Description,]
head(Blood_data) %>% htmlTable
var_id varScan_pval.ratio var_coord motif TF TF_TE.random gene gene_qval.random gene_TE.random gene_tau2 gene_I2 ALT Consequence IMPACT EXON INTRON DISTANCE CLIN_SIG
1 rs10015223 76 4:4387146-4387147_+ MA0691.1 TFAP4 -0.0258832292174166 NSG1 0.0166089583403988 -0.205856808798565 0.00156274318763709 0 G 5_prime_UTR_variant MODIFIER 2/6 - - -
2 rs10015223 29.09 4:4387146-4387147_+ MA0691.1 TFAP4 -0.0258832292174166 NSG1 0.0166089583403988 -0.205856808798565 0.00156274318763709 0 G 5_prime_UTR_variant MODIFIER 2/6 - - -
3 rs10015223 76 4:4387146-4387147_+ MA0691.1 TFAP4 -0.0258832292174166 NSG1 0.0166089583403988 -0.205856808798565 0.00156274318763709 0 G 5_prime_UTR_variant MODIFIER 1/5 - - -
4 rs10015223 29.09 4:4387146-4387147_+ MA0691.1 TFAP4 -0.0258832292174166 NSG1 0.0166089583403988 -0.205856808798565 0.00156274318763709 0 G 5_prime_UTR_variant MODIFIER 1/5 - - -
5 rs10015223 76 4:4387146-4387147_+ MA0691.1 TFAP4 -0.0258832292174166 NSG1 0.0166089583403988 -0.205856808798565 0.00156274318763709 0 G 5_prime_UTR_variant MODIFIER 4/8 - - -
6 rs10015223 29.09 4:4387146-4387147_+ MA0691.1 TFAP4 -0.0258832292174166 NSG1 0.0166089583403988 -0.205856808798565 0.00156274318763709 0 G 5_prime_UTR_variant MODIFIER 4/8 - - -

There’re 1964 variants. I ran all of them in VEP, and I obtain the same 1964 variants, associated with 966 genes. From there I first remove the ones that weren’t in our filter lood meta-analysis (1570 genes) and retrieve 427 genes. Then I did another filter with the UCSC, removing all the genes that aren’t close to the variants, we obtain 386 genes. After that, we remove the TFs that in normal contidions are not being expressed (TPM == 0), so finally we find 354 significant genes.

If I do the analysis without VEP, and only search for the genes that are close to the variants, I obtain 394 genes. The main difference is that with VEP I obtain more information about the association.

Result analysis

Now that I have all the results, I’m going to be doing the final tables to be able to interpret my results better. To do that I’ll start with some tables that will have different information all together and then some figures with the best association.

Variants and TFs for each gene (Table 1)

gene_variants_tfs_function <- function(x, data, s){
  gene_data <- data %>% filter(gene == x)
  return(c(x, toString(unique(gene_data$var_id)), toString(unique(gene_data$TF)), s))
}
gene_variants_tfs <- rbind(as.data.frame(t(sapply(X = unique(Tissue_data$gene), function(x){return(gene_variants_tfs_function(x, Tissue_data, "tissue"))}))), as.data.frame(t(sapply(X = unique(Blood_data$gene), function(x){return(gene_variants_tfs_function(x, Blood_data, "blood"))}))))
rownames(gene_variants_tfs) <- NULL
colnames(gene_variants_tfs) <- c("gene", "variants", "TF", "analysis")
write.table(gene_variants_tfs, file = "Results/Tables/Gene_variants_tfs.csv", row.names = F, quote = F, sep = "\t")
head(gene_variants_tfs) %>% htmlTable
gene variants TF analysis
1 PEX13 rs1002699754, rs771610641 STAT1, ZSCAN4 tissue
2 HBB rs1005042281, rs1055395965, rs1141370, rs1160543272, rs1250042245, rs1302097141, rs1320832744, rs1554917555, rs193922551, rs193922558, rs33913413, rs33919924, rs33922842, rs33922873, rs33925391, rs33926796, rs33927093, rs33929459, rs33930385, rs33931779, rs33931806, rs33932070, rs33933298, rs33937535, rs33941377, rs33944208, rs33945546, rs33946157, rs33946775, rs33951978, rs33952266, rs33952543, rs33953406, rs33954595, rs33958088, rs33958637, rs33959340, rs33961444, rs33974228, rs33974936, rs33976006, rs33978082, rs33978338, rs33985510, rs33987957, rs33990858, rs33991294, rs33991472, rs33991993, rs33994806, rs34095019, rs34126315, rs34227486, rs34515413, rs34726542, rs34999973, rs35286210, rs35328027, rs35747961, rs35890959, rs63750400, rs63750628, rs63750840, rs713040, rs760975738, rs901033731 HNF1B, TEAD3, GCM1, HOXD10, HOXC10, HOXC9, POU2F2, ATF3, ZNF24, LHX9, LMX1A, TLX2, LHX6, EMX2, HOXB3, NKX6-1, HESX1, PRRX2, LBX2, NKX6-2, DRGX, PRRX1, MIXL1, ZEB1, STAT3, RHOXF1, ZNF75D, OSR1, ZNF274, NR6A1, BARX1, ESR2, NR3C2, HOXA2, HOXD3, HOXD4, HOXA5, HOXB5, HOXB4, HOXA4, HOXC4, NR1I3, NKX2-8, NR5A1, TFAP2E, FOSL2, ASCL1, SOHLH2, MEIS2, MEIS1, KLF5, KLF6, KLF2, YY1, TFAP2B, CTCF, TBX6, TBX3, TBX15, TBX5, ELK4, ELF4, ELF5, DMRT3, JUN, MAX, TFEB, MLXIPL, FOS, HOXD8, PPARD, RXRB, RXRG, CREB3L1, RFX4, RFX5, SNAI3, OLIG1, TFEC, HIF1A, RFX2, HOXD9, CDX1, HOXA10, GFI1, MYB, ZBTB18, ZNF135 tissue
3 WFS1 rs1014892217 NR2C1, NR2F1, ESR2 tissue
4 CENPM rs1023497, rs5758511 HSF2, EGR3 tissue
5 ACADVL rs1023643662, rs1032857886, rs1057523521, rs112406105, rs118204015, rs139425622, rs141167669, rs144255994, rs1458941582, rs1555526667, rs1555527732, rs1567565417, rs374911841, rs377044444, rs387906253, rs398123081, rs398123083, rs545215807, rs727503788, rs749159573, rs771874163, rs775761275, rs77763289, rs777751102, rs780020193, rs796051907, rs796051918, rs886044100, rs890862631 KLF5, MAZ, HIC2, ESR2, BARHL1, RHOXF1, ASCL1, TFAP4, NKX2-8, TCF7L2, TFAP2E, GCM1, KLF2, KLF6, NR2C1, TGIF2, TBX4, TBX2, NR2F1, TBX3, TBX6, ZFP42, NR3C2, NR3C1, FIGLA, OLIG1, NR5A1, NHLH2, NHLH1, FOS, GATA5, SNAI1, ETV4, NFATC4 tissue
6 CCL2 rs1024611 TFAP4 tissue

variant, gene and TFs association (Table 2)

variant_gene_tf <- rbind(cbind(unique(Tissue_data[,c(1, 5, 6, 7, 9)]), analysis = rep("tissue", times = nrow(unique(Tissue_data[,c(1, 5, 6, 7, 9)])))), cbind(unique(Blood_data[,c(1, 5, 6, 7, 9)]), analysis = rep("blood", times = nrow(unique(Blood_data[,c(1, 5, 6, 7, 9)])))))
write.table(variant_gene_tf, file = "./Results/Tables/Variant_gene_tf.csv", row.names = F, quote = F, sep = "\t")
head(variant_gene_tf) %>% htmlTable()
var_id TF TF_TE.random gene gene_TE.random analysis
1 rs1002699754 STAT1 0.0379079596826783 PEX13 0.115720512184952 tissue
33 rs1005042281 HNF1B -0.0172428699479146 HBB 0.179282484897724 tissue
103 rs1014892217 NR2C1 -0.0913787642762126 WFS1 -0.201667114660007 tissue
104 rs1014892217 NR2F1 -0.100398737715935 WFS1 -0.201667114660007 tissue
106 rs1014892217 ESR2 0.0575311120768812 WFS1 -0.201667114660007 tissue
233 rs1023497 HSF2 -0.0350147089725436 CENPM 0.0964497445898264 tissue

TP53 data

TP53_consequences <-as.factor(filter(Blood_data, gene == "TP53")$Consequence)
TP53_function <- function(x, datas){
  return(c(x, nrow(filter(datas, Consequence == x)), length(unique(filter(datas, Consequence == x)$TF)), length(unique(filter(datas, Consequence == x)$var_id))))
}

TP53_data <- unique(filter(Blood_data, gene == "TP53")[,c(1, 5, 13)]) 
TP53_consequences <- as.data.frame(base::table(unlist(TP53_data$Consequence)))
TP53_other <- TP53_data[TP53_data$Consequence %in% unlist(TP53_consequences[TP53_consequences$Freq < 5,]$Var1),]
TP53_other <- data.frame(Consequence = "other", num_consequence = nrow(TP53_other), num_TFs = length(unique(TP53_other$TF)), num_var = length(unique(TP53_other$var_id)))
TP53_NoOther <- TP53_data[!TP53_data$Consequence %in% unlist(TP53_consequences[TP53_consequences$Freq < 5,]$Var1),]
TP53_NoOther <- as.data.frame(t(sapply(X = unique(TP53_NoOther$Consequence), function(x){TP53_function(x, TP53_NoOther)})))
colnames(TP53_NoOther) <- c("Consequence", "num_consequence", "num_TFs", "num_var")
TP53_consequences <- rbind(TP53_NoOther, TP53_other)
TP53_consequences <- data.frame(consequence = rep(TP53_consequences$Consequence, each = 3), num = rep(c("num_consequence", "num_TFs", "num_var"), times = nrow(TP53_consequences)), value = as.numeric(unlist(t(TP53_consequences[c(2:4)]))))
ggsave( filename = "./Results/Plots/TP53_consequences.png",
ggplot(data=TP53_consequences, aes(x = reorder(consequence, value), y=value, fill=num)) +
    geom_bar(stat="identity", position=position_dodge()) + theme(axis.text.y = element_text(size = 8)) + coord_flip() + labs(title = "TP53 consequences", y = "Appearances", x ="Consequence")
)
## Saving 8 x 5 in image

Figure 4: TP53 consequences